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Abstract 

We consider the problem of learning a linear factor model with an unknown number of factors. We 
propose a regularized form of principal component analysis (PCA) and demonstrate through experiments 
with synthetic and real data the superiority of resulting estimates to those produced by pre-existing factor 
analysis approaches. We also establish theoretical results that explain how our algorithm corrects the 
biases induced by conventional approaches. An important feature of our algorithm is its computational 
efficiency being close to that of PCA, which enjoys wide use in large part due to its efficiency. 

1 Introduction 

Linear factor models have been widely used for a long time and wit h notable succe ss in economics, finance, 
medicine, psychology, and various other natural and social sciences ( Harmanl . Il976l ). In such a model, each 



observed variable is a linear combination of unobserved common factors plus idiosyncratic noise, and the 
collection of random variables is jointly Gaussian. We consider in this paper the problem of learning a factor 
model from a training set of vector observations. In particular, our learning problem entails simultaneously 
estimating the number of factors, the loadings of each factor, and the residual variance of each variable. We 
seek an estimate of these parameters that best explains out-of-sample data. For this purpose, we consider 
the likelihood of test data that is independent of the training data. As such, our goal is to design a learning 
algorithm that maximizes the likelihood of a test set that is not used in the learning process. 

A common approach to factor model learning involves application of principal component analysis (PCA). 
If the number of factors is known and residual variances are assumed to be uniform, PCA can be applied 
to eff iciently compute model parameters that maximize likelihood of the training data ( Tipping and Bishopl . 



Il999h . In order to simplify analysis, we begin our study with a context for which PCA is ideally suited. 
In particular, before treating more general models, we will restrict attention to models in which residual 
variances are uniform. As a baseline among learning algorithms, we consider applying PCA together with 
cross-validation, computing likelihood-maximizing parameters for different numbers of factors and selecting 
the number of factors that maximizes likelihood of a portion of the training data that is reserved for validation. 
We will refer to this baseline as uniform-residual rank- constrained maximum-likelihood (URM) estimation. 

To improve on URM, we propose uniform-residual trace-penalized maximum-likelihood (UTM) estimation. 
Rather than estimating parameters of a model with a fixed number of factors and iterating over the number 
of factors, this approach maximizes likelihood across models without restricting the number of factors but 
instead penalizing the trace of a matrix derived from the model's covariance matrix. This trace penalty 
serves to regularize the model and naturally selects a parsimonious set of factors. The coefficient of this 
penalty is chosen via cross-validation, similarly with the way in which the number of factors is selected by 
URM. Through a computational study using synthetic data, we demonstrate that UTM results in better 
estimates than URM. In particular, we find that UTM requires as little as two-thirds of the quantity of 
data used by URM to match its performance. Further, leveraging recent work on random matrix theory, we 
establish theoretical results that explain how UTM corrects the biases induced by URM. 

We then extend UTM to address the more general and practically relevant learning problem in which 
residual variances are not assumed to be uniform. To evaluate the resulting algorithm, which we refer to as 
scaled trace-penalized maximum-likelihood (STM) estimation, we carry out experiments using both synthetic 
data and real stock price data. The computational results demonstrate that STM leads to more accurate 
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estimates than alternatives available from prior art. We also provide an analysis to illustrate how these 
alternatives can suffer from biases in this nonuniform residual variance setting. 

Aside from the aforementioned empirical and theoretical analyses, an important contribution of this 
paper is in the design of algorithms that make UTM and STM efficient. The UTM approach is formulated 
as a convex semidefinite program (SDP), which can be solved by existing algorithm s such as interior-point 
methods or alternating direction method of multipliers (see, e.g.. IBovd et al However, when data 

dimension is large, as is the case in many contexts of practical import, such algorithms can take too long 
to be practically useful. This exemplifies a recurring obstacle that arises in the use of SDP formulations 
to study large data sets. We propose an algorithm based on PCA that solves UTM formulation efficiently. 
In particular, we found this method to typically require three orders of magnitude less compute time than 
alternating direction method of multipliers. Variations of PCA such as URM have enjoyed wide use to a 
large extent because of their efficiency, and the computation time required for UTM is essentially the same 
as that of URM. STM requires additional computation but remains in close reach for problems where the 
computational costs of URM are accept able. 

Our formulation is related to that of Chandrasekaran et al. (2010), which estimates a factor model using 
a similar t race penalty. There are some important differences, however, that distinguish our work. First, the 
analysis of Chandrasekaran et al. ( 2010t ) focuses on establishing perfect recovery of structure in an asymptotic 
regime, whereas our work makes the point that this trace penalty reduces nonasymptotic bias. Second, 
our approach to dealing with nonuniform residual variances is distinctive and we demonstrate through 
computational and theoretical analysis that this difference reduces bias. Third. [Chandrasekaran et al. (2010) 
treats the problem as a semidefinite program, whose solution is often computationally demanding when data 
dimension is large. We provide an algorithm bas ed on PCA that efficien t ly solv es our problem. The algorithm 
can also be adapted to solve the formulation of Chandrasekaran et al. ( 2010l) , though that is not the aim of 
our paper. 

In addition, there is another thread of research on regulariz ed maximum-likelihoo d estimation for covari- 
ance matrices that relates loosely to this paper. Along this line. lBanerjee et al.l (|2008f ) regularizes maximum- 
likelihood estimation by the l\ norm of the inverse covariance matrix in or der to recover a s parse graphical 
model. An efficient algorithm called graphical Lasso was then proposed by Friedman et al. ( 2008f) for solv- 
i ng th is formulation. Similar formulations can also be found in Yuan and Linl ( 2007 ) and Ravikumar et al 



([20111 ). who instead penalize the l\ norm of off-diagonal elements of the inverse covariance matrix when 
computing maximum-likelihood estimates. Although our approach shares some of the spirit reflected in this 
line of research in that we also regularize maximum-likelihood estimation by an ^i-like penalty, the settings 
are fundamentally different: while ours focuses on a factor model, theirs are based on sparse graphical mod- 
els. We propose an approach that corrects the bias induced by conventional factor analysis, whereas their 
results are mainly concerned with accurate recovery of the topology of an underlying graph. As such, their 
work does not address biases in covariance estimates. On the algorithmic front, we develop a simple and 
efficient solution method that builds on PCA. On the contrary, their algorithms are more complicated and 
computationally demanding. 



2 Problem Formulation 

We consider the problem of learning a factor model without knowledge of the number of factors. Specifically, 
we want to estimate a M x M covariance matrix from samples x/n, . . . ,X(jy) ~ A/"(0, £*), where X* is 

the sum of a symmetric matrix >z and a diagonal matrix R* >: 0. These samples can be thought of 

i 

as generated by a factor model of the form X( n ) = F* z/ n ) + W( n ), where Zr n \ ~ Af(0,T) represents a set of 
common factors and W(„) ~ A/"(0,R*) represents residual noise. The number of factors is represented by 
rank(F*), and it is usually assumed to be much smaller than the dimension M. 

Our goal is to produce based on the observed samples a factor loadings matrix F y and a residual 
variance matrix R >z such that the resulting factor model best explains out-of-sample data. In particular, 
we seek a pair of (F, R) such that the covariance matrix £ = F + R maximizes the average log-likelihood of 
out-of-sample data: 

L(E,£,)- E [logp(x|S)]^-i(Aflog(27r)+logdet(S)+tr(S- 1 SO). 
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This is also equivalent to minimizing the Kullback-Leibler divergence between Af(0, S*) and 7V(0, £). 



3 Learning Algorithms 

Given our objective, one simple approach is to choose an estimate S that maximizes in-sample log-likelihood: 

logp(*|£) = ™ (A/log(2^) +logdet(S) +tr(S- 1 S SAM )) , (1) 

where X = {x^j, . . . , x^)}, and we use Esam = Y^n=i x (») x («)/-^ to denote the sample covariance matrix. 
Here, the maximum likelihood estimate is simply given by X = Ssam- 

The problem with maximum likelihood estimation in this context is that in-sample log-likelihood does not 
accurately predict out-of-sample log-likelihood unless the number of samples N far exceeds the dimension 
M. In fact, when the number of samples TV is smaller than the dimension M, Ssam is ill-conditioned and the 
out-of-sample log-likelihood is negative infinity. One remedy to such poor generalization involves exploiting 
factor structure, as we discuss in this section. 



3.1 Uniform Residual Variances 

We begin with a simplified scenario in which the residual variances are assumed to be identical. As we will 
later see, such simplification facilitates theoretical analysis. This assumption will be relaxed in the next 
subsection. 



3.1.1 Constraining the Number of Factors 



Given a belief that the data is generated by a factor model with few factors, one natural approach is 
to employ maximum likelihood estimation with a constraint on the number of factors. Now suppose the 
residual variances in the generative model are identical, and as a result we impose an additional assumption 
that R is a multiple a 2 l of the identity matrix. This leads to an optimization problem 



max 

s.t. 



logp(*|E) 

£ = F + a 2 I 
rank(F) < K 



(2) 



where Sip denote the set of all M x M positive semidefinite symmetric matrices, and K is the exogenously 
specified number of factors. In this case, we can efficiently comp u te an analytical solution via principal 
component analysis (PCA), as established in Tipping and Bishop! ( 19991 ). This involves first computing 

■ bjvf] is 



an eigendecomposition of the sample covariance matrix 
orthonormal and S = diag(si, . . . , sjj) with s\ > . . . > Sm 



Ssam = BSB 1 , where B = [bi . 
. The solution to is then given by 



a 2 



1 



M - K 



M 

£ ■ 

=K+1 



K 



Y> fe -a 2 )b fe bT. 



(3) 



fc=i 



In other words, the estimate for residual variance equals the average of the last M — K sample eigenvalues, 
whereas the estimate for factor loading matrix is spanned by the top K sample eigenvectors with coefficients 
Sfc — a 2 . We will refer to this method as uniform-residual rank- constrained maximum-likelihood estimation, 
and use Sy RM = F + a 2 l to denote the covariance matrix resulting from this procedure. It is easy to see 
that the eigenvalues of Sy RM are Si, . . . , sk, & 2 , . . . ,a 2 , as illustrated in Figure [TJa). 

A numbe r of methods have been pro posed for estimating the number of factors K (jAkaikd . Il987t iBishop . 

1998; Minka, 2000; iHirose et all 120111 ). Cross-validation provides a conceptually simple approach that in 
practice works at least about as well as any other. To obtain best performance from such a procedure, one 
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would make use of so-called n-fold cross-validation. To keep things simple in our study and comparison of 
estimation methods, for all methods we will consider, we employ a version of cross-validation that reserves 
a single subset of data for validation and selection of K. Details of the procedure we used can be found in 
the appendix. Through selection of K, this procedure arrives at a covariance matrix which we will denote 
by Surm- 



3.1.2 Penalizing the Trace 

Although ([2]) can be elegantly solved via PCA, it is unclear that imposing a hard constraint on the number 
of factors will lead to an optimal estimate. In particular, one might suspect a "softer" regularization could 
improve estimation accuracy. Motivated by this idea, we propose penalizing the trace instead of constraining 
the rank of the factor loading matrix. As we shall see in the experiment results and theoretical analysis, 
such an approach indeed improves estimation accuracy significantly. 

Nevertheless, naively replacing the rank constraint of ([2]) by a trace constraint tr(F) < t will result in a 
non-convex optimization problem, and it is not clear to us whether it can be solved efficiently. Let us explore 
a related alternative. Some straightforward matrix algebra shows that if X = F + er 2 I with F <E S* 1 and 
a 2 > 0, then the matrix defined by G = cr~ 2 I — is in S¥, with rank(G) = rank(F). This observation, 
together with the well-known fact that the log-likelihood of X is concave in the inverse covariance matrix 
motivates the following convex program: 

max logp(A"|S) 

s.t. E _1 =t;I-G 
tr(G) < t. 

Here, the variable v represents the reciprocal of residual variance. Pricing out the trace constraint leads to 
a closely related problem in which the trace is penalized rather than constrained: 

max logp(ATlE) - Atr(G) (4) 

Ges* r ,«eR+ 

s.t. X 1 =vl-G. 

We will consider the trace penalized problem instead of the trace constrained problem because it is more 
convenient to design algorithms that address the penalty rather than the constraint. Let (G, v) be an optimal 
solution to and let EyTM = (vl — G) _1 denote the covariance matrix estimate derived from it. Here, 
the "U" indicates that residual variances are assumed to be uniform across variables and "T" stands for 
trace-penalized. 

It is easy to see that Q is a semidefinite program. As such, the problem can be solved in polynomial 
time by existing algorithms such as interior-point methods or alternating direction method of multipliers 
(AD MM). However, when the number of variables M is large, as is the case in many contexts of practical 
import, such algorithms can take too long to be practically useful. One contribution of this paper is an 
efficient method for solving (|4]), which we now describe. The following result motivates the algorithm we 
will propose for computing Sy TM : 

Theorem 1 Ssam and Sy TM share the same trace and eigenvectors, and letting the eigenvalues of the two 
matrices, sorted in decreasing order, be denoted by s\, . . . , sj\/ and hi, ... , h,M, respectively, we have 

h m = max |s m - i j , for m = 1, . . . , M. (5) 

This theorem suggests an algorithm for computing Sy TM . First, we compute the eigendecomposition of 
Ssam = BSB T , where B and S are as defined in Section 13.1.11 This provides the eigenvectors and trace of 
Sy TM . To obtain its eigenvalues, we only need to determine the value of v such that the eigenvalues given 
by ([3]) sum to the desired trace. This is equivalent to determining the largest integer K such that 

2A 

SK ~N > M 
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To see this, note that setting 



\ m=K+l / 

f s m - , m =l,...,K 
\ v^ 1 ,m = K+1,...,M 

uniquely satisfies ([3]) and ensures 53m=i ^™ = El=i s ™- Algorithm [T] presents this method in greater detail. 
In our experiments, we found this method to typically require three orders of magnitude less compute time 
than ADMM. For example, it can solve a problem of dimension M = 1000 within seconds on a workstation, 
whereas ADMM requires hours to attain the same level of accuracy. 

Also note that for reasonably large A, this algorithm will flatten most sample eigenvalues and allow only 
the largest eigenvalues to remain outstanding, effectively producing a factor model estimate. Figure QTb) 
illustrates this effect. Comparing Figure Ufa) and Figure Hfb), it is easy to see that URM and UTM primarily 
differ in the largest eigenvalues they produce: while URM simply retains the largest sample eigenvalues, UTM 
subtracts a constant 2X/N from them. As we shall see in the theoretical analysis, this subtraction indeed 
corrects the bias incurred in sample eigenvalues. 





position position 

(a) (b) 



Figure 1: (a) An example of sample eigenvalues and the corresponding eigenvalues of URM estimate, with 
M = 5 and K = 2. URM essentially preserves the top eigenvalues and averages the remaining ones as residual 
variance, (b) An example of sample eigenvalues and the corresponding eigenvalues of UTM estimate. With a 
particular choice of A, this estimate has two outstanding eigenvalues, but their magnitudes are 2X/N below 
the sample ones. Its residual variance is determined in a way that ensures the summation of UTM 
eigenvalues equal to the sample one. 

Like URM, the most computationally expensive step of UTM lies in the eigendecomposition. Beyond 
that, the evaluation of UTM eigenvalues for any given A takes O(M), and is generally negligible. In our 
implementation, this regularization parameter A is chosen by cross-validation from a range around Ma 2 , 
whose reasons will become apparent in Section 15.11 We denote by Sjjtm the covariance matrix resulting 
from this cross-validation procedure. 

3.2 Nonuniform Residual Variances 

We now relax the assumption of uniform residual variances and discuss several methods for the general case. 
As in the previous subsection, the hyper-parameters of these methods will be selected by cross-validation. 

1 In fact, a full eigendecomposition is not required, as we only need the top K eigenvectors to compute the estimate. 
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Algorithm 1 Procedure for computing Sutm 
Input: X, A 
Output: EyiM 

Compute eigendecomposition Ssam = BSB T 

«* 1 <" Mh { k ■ W + Ei= fe+ i s m ) , Vfc = 0, 1, . . . , M - 1 

K <— max {fc : Sfc — ^ > v^ 1 } / / define s = oo 

v 4— Vk 

h m ^{ !V f , Vm = l,...,M 

\ y- 1 otherwise 



3.2.1 Constraining the Number of Factors 

Without the assumption of uniform residual variances, the ranked-constrained maximum-likelihood formu- 
lation can be written as 

max logp(AflS) (6) 

s.t. S = F + R 
rank(F) < K 

where denote the set of all M x M positive semidefinite diagonal matrices. Unlike ©, this formulation 
is generally hard to solve, and therefore we consider two widely- used approxi mate solutions. 



mm 

The expectation-maximization (EM) algorithm (jRubin and Thaveii I1982T) is arguably the most conven 



tional approach to solving ([6]) , though there is no guarantee that this will result in a global optimal solution. 
The algorithm generates a sequence of iterates F? € M. MxK and R € D* f , such that the covariance matrix 

1 T 

S = F2F~+R increases the log- likelihood of X with each iteration. Each iteration involves an estimation 

1 T 

step in which we assume the data are generated according to the covariance matrix X = F2F2" +R, and 
compute expectations 

Efz (n) |x (n) ] = (ijc + FiR- 1 ^) FSR-^ 

-1 



E [ z (n) z (n)l x (n)] = (lif + F^R + E[z (n) |x ( „ ) ]E[z (n) |x (ll) ] T , 

for n = 1, . . . , N. A maximization step then updates F and R according to 



Rm,m <- ( S SA M - F^ — ^ E[z (n) |x (n )]x^ l) J 

V n— 1 / rn.m 

for m = 1, . . . , M. In our implementation, the initial F and R are selected by the MRH algorithm described 
in the next paragraph. We will denote the estimate produced by the EM algorithm by Sg M and that 
resulting from further selection of K through cross-validation by Sem- 

A common heuristic for approximately solving without entailing iterative computation is to first 
compute Sy RM by PCA and then take the factor matrix estimate to be the F defined in §5§ and the residual 

variances to be R m , m = ( Ssam — F ) , for m = 1, . . . , M. In other words, R m , m is selected so that the 

V / m,m 

diagonal elements of the estimated covariance matrix £ = F + R are equal to those of the sample covariance 
matrix. We will refer to this method as marginal-variance-preserving rank- constrained heuristic and denote 
the resulting estimates by S^ RH and Smrh- 



G 



3.2.2 Penalizing the Trace 



We now develop an extension of Algorithm [T] that applies when residual variances are nonuniform. One 
formulation that may seem natural involves replacing vl in Q with a diagonal matrix V € B^. That is, 

max logp(#|E) - Atr(G) (7) 

s.t. s * = v g. 



Indeed, a closely related formulation is proposed in Chandrasekaran et al. ( 2010t ). However, as we will see in 



Sections @] and [5j solutions to this formulation suffer from bias and do not compete well against the method 
we will propose next. That said, let us denote the estimates resulting from solving this formulation by S^, M 
and Stm, where "T" stands for trace-penalized. Also note that this formulation can be efficiently solved by 
a straightforward generalization of Theorem [TJ though we will not elaborate on this. 

Our approach involves componentwise scaling of the data. Consider an estimate S of X*. Recall that 
we evaluate the quality of the estimate using the expected log-likelihood L(E, £*) of out-of-sample data. If 
we multiply each data sample by a matrix T € M x , the data set becomes TX = {Txm, . . . , Tx^)}, 
where Tx (n) ~ Af(0, TS*T T ). If we also change our estimate accordingly to TST T then the new expected 
log-likelihood becomes 

L(TST T , T£*T T ) = L(S, 53, ) — logdet T. 

Therefore, as long as we constrain T to have unit determinant, L(TST T , TS*T T ) will be equal to L(S, S*), 
suggesting that if X is a good estimate of then TET T is a good estimate of TX„ C T T . This motivates the 
following optimization problem: 

max logj?(TAf|S) - Atr(G) (8) 

s.t. S x =wl-G. 
logdetT > 0. 

The solution to this problem identifies a componentwise-scaling matrix T € that allows the data to be 
best-explained by a factor model with uniform residual variances. Given an optimal solution, 1/T^ should be 
approximately proportional to the residual variance of the ith variable, so that scaling by T^j makes residual 
variances uniform. Note that the optimization problem constrains log det T to be nonnegative rather than 
zero. This makes the feasible region convex, and this constraint is binding at the optimal solution. Denote 
the optimal solution to ([8]) by (G, v, T). Our estimate is thus given by T~ 1 (-0I — G)~ lr T~ T . 

The objective function of ([8]) is not concave in (G, v, T), but is biconcave in (G, v) and T. We solve it by 
coordinate ascent, alternating between optimizing (G, v) and T. This procedure is guaranteed convergence. 
In our implementation, we initialize T by I. We will denote the resulting estimates by Sstm, where "ST" 
stands for scaled and trace-penalized. 



4 Experiments 

We carried out two sets of experiments to compare the performance of aforementioned algorithms. The first 
is based on synthetic data, whereas the second uses historical prices of stocks that make up the S&P 500 
index. 



4.1 Synthetic Data 

We generated two kinds of synthetic data. The first was generated by a model in which each residual has 
unit variance. This data was sampled according to the following procedure, which takes as input the number 
of factors if*, the dimension M, the factor variances a% and the number of samples N: 

1. Sample orthonormal vectors (/)%, 02, • • • 5 4>k* S K m isotropically. 

2. Sample f u h,...,f K . ~AA(0,a 2 f ). 
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3. Let FI = [frfa / 2 2 ... !kAkX 

4. Let E, = FJF? +1. 

5. Sample X(!), ... ,X(jv) iid from A/"(0, E*). 

We repeated this procedure one hundred times for each N € {50, 100, 200, 400}, with M — 200, — 10, and 
df = 5. We applied to this data URM and UTM, since they are methods designed to treat such a scenario 
with uniform residual variances. Regularization parameters K and A were selected via cross-validation, 
where about 70% of each data set was used for training and 30% for validation. Figure HJa) plots out-of- 
sample log-likelihood delivered by the two algorithms. Performance is plotted as a function of the log-ratio 
of the number of samples to the number of variables, which represents the availability of data relative to 
the number of variables. We expect this measure to drive performance differences. UTM outperforms URM 
in all scenarios. The difference is largest when data is scarce. When data is abundant, both methods work 
about as well. This should be expected since both estimation methods are consistent. 

To interpret this result in a more tangible way, we also plot the equivalent data requirement of UTM 
in Figure El^b). This metric is defined as the portion of training data required by UTM to match the 
performance of URM. As we can see, UTM needs as little as 67% of the data used by the URM to reach the 
same estimation accuracy. 
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Figure 2: (a) The average out-of-sample log-likelihood delivered by URM and UTM, when residual variances 
are identical, (b) The average portion of data required by UTM to match the performance of URM. The 
error bars denote 95% confidence interval. 



Our second type of synthetic data was generated using an entirely similar procedure except step 4 was 
replaced by 

E* = F?F? + diag(e ri , e r2 , . . . , e™ ) , 

where rx,...,rM were sampled iid from J\f(0,of)- Note that oy effectively controls the variation among 
residual variances. Since these residual variances are nonuniform, EM, MRH, TM, and STM were applied. 
Figure [3] plots the results for the cases oy = 0.5 and oy = 0.8, corresponding to moderate and large variation 
among residual variances, respectively. In either case, STM outperforms the alternatives. Figure |4] further 
gives the equivalent data requirement of STM with respect to each alternative. It is worth pointing out that 
the performance of MRH and TM degrades significantly as the variation among residual variances grows, 
while EM is less susceptible to such change. We will elaborate on this phenomenon in Section l5~2l 



4.2 S&P 500 Data 

An important application area of factor ana lysis is finance, where return covariances are used to assess risks 
and guide diversification ( Markowitd . 1952 ). The experiments we will now describe involve estimation of 
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Figure 4: The equivalent data requirement of STM with respect to EM, MRH, and TM when (a) o> = 0.5 
and (b) o r = 0.8. The error bars denote 95% confidence interval. 



such covariances from historical daily returns of stocks represented in the S&P 500 index as of March, 2011. 
We use price data collected from the period starting November 2, 2001, and ending August 9, 2007. This 
period was chosen to avoid the erratic market behavior observed during the bursting of the dot-com bubble 
in 2000 and the financial crisis that began in 2008. Normalized daily log-returns were computed from closing 
prices through a process described in detail in the appendix. Over this duration, there were 1400 trading 
days and 453 of the stocks under consideration were active. This produced a data set y = {y(i), ■ ■ • , y(i4oo)}> 
in which the ith component of y( t ) € M 453 represents the normalized log-daily-return of stock i on day t. 

We generated estimates corresponding to each among a subset of the 1400 days. As would be done in 
real-time application, for each such day t we used TV data points {y( t _Ar + i), . . . , Y(t)} that would have been 
available on that day to compute the estimate and subsequent data to assess performance. In particular, we 
generated estimates every ten days beginning on day 1200 and ending on day 1290. For each of these days, 
we evaluated average log-likelihood of log-daily- returns over the next ten days. Algorithm [2] formalizes this 
procedure. 

These tests served the purpose of sliding-window cross-validation, as we tried a range of regularization 
parameters over this time period and used test results to select a regularization parameter for each algorithm. 
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Algorithm 2 Testing Procedure T 

Input: learning algorithm U, regularization parameter 9, window size N, time point t 
Output: test-set log-likelihood 

X <- {y( t _jv+i), • ■ • ,y(t)l (training set) 
X' <- {y(t + i), • • • ,y(t+io)} (test set) 

£<-u(x,d) 

return logp(A?'|E) 



More specifically, for each algorithm U, its regularization parameter was selected by 

9 

§ = argmax V T{U, 9, N, 1200 + Wj). 

3=0 

On days 1300, 1310, . . . , 1390, we generated one estimate per day per algorithm, in each case using the 
regularization parameter selected earlier and evaluating average log-likelihood over the next ten days. For 
each algorithm U, we took the average of these ten ten-day averages to be its out-of-sample performance, 
defined as 

1 9 

— ]T T(W, 9,N, 1300 + 10.7). 

j=o 

Figure [5] plots the performance delivered EM, MRH, TM, and STM with N G {200, 300, . . . , 1200}. STM 
is the dominant solution. It is natural to ask why the performance of each algorithm improves then degrades 
as N grows. If the time series were stationary, one would expect performance to monotonically improve with 
AT. However, this is a real time series and is not necessarily stationary. We believe that the distribution 
changes enough over about a thousand trading days so that using historical data collected further back 
worsens estimates. This observation points out that in real applications STM is likely to generate superior 
results even when all aforementioned algorithms are allowed to use all available data. This is in contrast 
with the experiments of Section 14.11 involving synthetic data, which may have led to an impression that the 
performance difference could be made small by using more data. 
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Figure 5: The average log-likelihood of test set, delivered by EM, MRH, TM, and STM, over different 
training- window sizes N. 
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5 Analysis 



In this section, we explain why UTM and STM are expected to outperform alternatives as we have seen in 
our experimental results. 



5.1 Uniform Residual Variances 

Let us start with the simpler context in which residual variances are identical. In other words, let X* = 
+ er 2 I for a low rank matrix F* and uniform residual variance a 2 . We will begin our analysis with two 

desirable properties of UTM, and then move on to the comparison between UTM and URM. 

It is easy to see that Sy TM is a consistent estimator of for any A > 0, since limjv^oo ^ = and by 

Theorem Q] we have 

lim Su TM = lim £ S am ^> E*- 

TV— ^oo TV— >oo 

Another important property of UTM is the fact that the trace of UTM estimate is the same as that of sample 
covariance matrix. This preservation is desirable as suggested by the following result. 

Proposition 1 For any fixed N, K , scalars £\ > £2 > • • • > £k — & 2 > 0, and any sequence of covariance 
matrices Si* G S* f with eigenvalues £\, . . . , £k , o~ 2 , ■ ■ . , o 2 , we have 

tr^sAM a.s. ^ 



and 

Fr 



trE„ 



1 



tr£* 



Note that, for any fixed fixed number of samples N, the right-hand-side of (|9]) diminishes towards as data 
dimension M grows. In other words, as long as the data dimension is large compared to the number of 
factors K, the sample trace is usually a good estimate of the true one, even when we have very limited data 
samples. 

Now we would like to understand why UTM outperforms URM. Recall that, given an eigendecomposition 
^SAM = BSB T of the sample covariance matrix, estimates generated by URM and UTM admit eigendecom- 
positions Surm = BHurmB t and Sutm = BHutmB t , deviating from the sample eigenvalues S but not 
the eigenvectors B. Hence, URM and UTM differ only in the way they select eigenvalues: URM takes each 
eigenvalue to be either a constant or the corresponding sample eigenvalue, while UTM takes each eigenvalue 
to be either a constant or the corresponding sample eigenvalue less another constant. Thus, large eigenvalues 
produced by UTM are a constant offset less than those produced by URM, as illustrated in Figure [T] We 
now explain why such subtraction lends UTM an advantage over URM in high-dimensional cases. 

Given the eigenvectors B = [bi ■ ■ -bj\/], let us consider the optimal eigenvalues that maximize out-of- 
sample log-likelihood of the estimate. Specifically, let us define 

H* = argmaxL(BHB T ,S >t ). 

With some straightforward algebra, we can show that H* = diag(/i|, . . . , h* M ), where h* = b^S^b^, for 
1 = 1,..., M . Let each ith sample eigenvalue be denoted by Sj = S^j, and let the ith largest eigenvalue of 
X* be d enoted by lj. The following theo rem, whose p roof relies on two results from random matrix theory 
found in iBaik and Silverstein (|2006l) and Haul (|2007l ). relates sample eigenvalues s, to optimal eigenvalues 

h* 

Theorem 2 For all K, scalars i\ > £2 > ■ ■ • > £k > & 2 > 0, p G (0,1), sequences N(m) such that 
|M/7V( A /) — p\ = o(l/ N(m))> covariance matrices si M ^ € with eigenvalues £\, . . . ,£k,& 2 , ■ ■ ■ ,& 2 , and 
i such that £i > (1 + y / p)c 2 , there exists e$ G (0, 2a 2 j(£i — a 2 )) such that h* — — > Sj — (2 + e.i)po~ 2 . 
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Consider eigenvalues £i that are large enough so that ej is negligible. In such cases, when in the asymptotic 
regime identified by Theorem [2] we have h* f» Sj — 2 pa 2 . This observation suggests that, when the number 
of factors K is relatively small compared to data dimension M, and when M and number of samples N scale 
proportionally to large numbers, the way in which UTM subtracts a constant from large sample eigenvalues 
should improve performance relative to URM, which does not modify large sample eigenvalues. Furthermore, 
comparing Theorem [1] and El we can see that the correction term should satisfy ^ ~ 2pcr 2 , or equivalently 
A ~ Ma 2 . This relation can help us narrow the search range of A in cross-validation. 

It is worth poi nting out that th e over-shooting effect of sample eigenvalues is well known in statistics 
literature (see, e.g, Johnstone] ( 2001 ) ). Our contribution, however, is to quantify this effect for factor models, 



and show that the large eigenvalues are not only biased high, but biased high by the same amount. 
5.2 Nonuniform Residual Variances 

Comparing Figure El El and HI we can see that the relation between URM and UTM is analogous to that 
between EM and STM. Specifically, the equivalent data requirement of UTM versus URM behaves very 
similarly as that of STM versus EM. This should not be surprising, as we now explain. 

To develop an intuitive understanding of this phenomenon, let us consider an idealized, analytically 
tractable context in which both EM and STM successfully estimate the relative magnitudes of residual 
variances. In particular, suppose we impose an additional constraint R oc R* into ([5]) and an additional 

constraint T oc R* 2 into (jHJ. El In this case, it is straightforward to show that EM is equivalent to URM 

with data scaled by R* 2 , and STM is equivalent to UTM with the same scaled data. Therefore, by the 
argument given in Section [5Tl it is natural to expect STM outperforms EM. 

A question that remains, however, is why MRH and TM are not as effective as STM. We believe the reason 
to be that they tend to select factor loadings that assign larger values than appropriate to variables with large 
residual variances. Indeed, such disadvantage has been observed in our synthetic data experiment: when 
the variation among residual variances increases, the performances of MRH and TM degrade significantly, 
as shown in Figure El 

Again, let us illustrate this phenomenon through an idealized context. Specifically, consider a case 
in which the sample covariance matrix Xsam turns out to be identical to X» = + R*, with R* = 
diag(r, 1, 1, . . . , 1) and F* = 11 T , where 1 is a vector with every component equal to 1. Recall that MRH 
uses the eigenvectors of Xsam corresponding to the largest eigenvalues as factor loading vectors. One would 
hope that factor loading estimates are insensitive to underlying residual variances. However, the following 
proposition suggests that, as r grows, the first component of the first eigenvector of Xsam dominates other 
components by an unbounded ratio. 

Proposition 2 Suppose R* = diag(r, 1, 1, . . . , 1) and = [1 1 ... 1] T [1 1 •■• 1], r > 1. Let 

f = [fi ... /a/] T be the top eigenvector o/X*. Then we have fi/fi — f2(r),Vi > 1. 

As such, the factor estimated by this top eigenvector can be grossly misrepresented, implying MRH is not 
preferable when residual variances differ significantly from each other. 

TM suffers from a similar problem, though possibly to a lesser degree. The matrix V in the TM formu- 
lation represents an estimate of R" 1 . For simplicity, let us consider an idealized TM formulation which 
further incorporates a constraint V = R^T 1 . That is, 

max \ogp(X\T,) - Atr(G) (10) 

s.t. X _1 =V-G 

Xsam = X* 
V = R^ 1 . 

Using the same setting as in Proposition El we can show that when this idealized TM algorithm produces 
an estimate of exactly one factor as desired, the first component of the estimated factor loading vector is 
strictly larger than the other components, as formally described in the following proposition. 



2 Here we use the notation A oc B to mean that there exists 7 > such that A = 7B. 
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Proposition 3 Suppose R* and F* are given as in Proposition^ and let the estimate resulting from I10|) 
feE = R* + F. Then for all A > we have 

1. rank(F) = 1 if and only if A < MN/2. 

2. In that case, if we rewrite F as ff T , where f = [/i ... /m] T j iften Vi > 1, «s greater than 1 
and monotonically increasing with r. Furthermore, if A > (M — l)N/2, then f\jf% = Q(r). 

Again, this represents a bias that overemphasizes the variable with large residual variance, even when we 
incorporate additional information into the formulation. On the contrary, it is easy to see that STM can 
accurately recover all major factors if similar favorable constraints are incorporated into its formulation (ie., 

if we set S SA m = £* and T oc R^ 3 in © ). 



6 Conclusion 



We propose factor model estimates UTM and STM, both of which are regularized versions of those that would 
be produced via PCA. UTM deals with the context where residual variances are assumed to be uniform, 
whereas STM handles variation among residual variances. Our algorithm for computing the UTM estimate 
is as efficient as conventional PCA. For STM, we provide an iterative algorithm with guaranteed convergence. 
Computational experiments involving both synthetic and real data demonstrate that the estimates produced 
by our approach are significantly more accurate than those produced by pre-existing methods. Further, we 
provide a theoretical analysis that elucidates the way in which UTM and STM corrects the biases induced 
by alternative approaches. 

Let us close by mentioning a few directions for potential extensions. We have used out-of-sample log- 
likelihood to assess accuracy of estimates. In practice, estimates are used to guide subsequent decisions. It 
would be interesting to study the impact of STM on decision quality and whether there are other approaches 
that fare better in this dimension. Further, in some applications, PCA is used to identify a subspace for 
dimension reduction. It would be interesting to understand if and when the subspace identified by STM is 
more suitable. Finally, there is a growing body of research on robust variations of factor analysis and PCA 
Thes e include the pursui t of sparse factor loadings (Ijolliffe et all . 12003 : Zou et al. . 2004 : D'Aspremont et al 



2004 : Johnstone and Lu . 2007 : Amini and Wainwrighd. l2009h . and th e methods that are resistant to cor- 
rupted data (jPison et al 1 120031 : ICandes et all 120091: IXu et all . l2010h . It would be interesting to explore 
connections to this body of work. 



A Proofs 

We first prove a main lemma that will be used in the proof of Theorem [1] and Proposition [3] 
Lemma 1 Fixing V £ consider the optimization problem 

max logpfVrlE) - Atr(G) (11) 
s.t. S X =V-G. 

Let Gv be the solution to A' = 2X/N, anc! UDU T be an eigendecomposition of matrixV^ (Ssam — A'I)V5 
with U orthonormal. Then we have (V - Gy)" 1 = V-5ULU T V-5, where L is a diagonal matrix with 
entries L^ = max{Di.i, l},Vi = 1, . . . ,M. 

Proof We can rewrite (fTTj) as 

nfin - logdet (V - G) + tr((V - G) S SA m) + A'tr(G) 

s.t. G e Si 1 



13 



Now associate a Lagrange multiplier i~2 £ S* f with the G £ constraint and write down the Lagrangian 
as 

£(G, O) = - logdet (V - G) + tr((V - G) S S am) + A'tr(G) - tr(«G). 
Let fl* denote the dual solution. By KKT conditions we have: 

V G £ = (V - Gv)- 1 - S SAM + A'l - n* = (12) 

n f ,Gv £ Sf (13) 
tr(f2*G v ) = 0. (14) 

Recall that 

(V-Gvr 1 = (V^V' -G v ) _1 = V"5(I- V-3G v V-5)-iv-i (15) 
Plugging this into (fl"2l we get 

V _ '(I - V-'GyV-i^V-s = Ssam - A'l + £1* 

and so 

I-V-^GvV - ') 1 = V5 (Ssam- A'l + Sl*)\i (16) 

= V5(s sam -a'i)v= +v*n*vi 

By (|I3J|, both V"5G V V^ andVi«*V5 are in S* 1 . Let an eigendecomposition of V i G v V" i be AQA T 
for which A = [ai ... aju] is orthonormal and Q^i > 0, i = 1, . . . , M. Using (fT4|) and the fact that trace 
is invariant under cyclic permutations, we have 

= tr(JTGv) = tr(rj*V^V-2GvV-5V2) 

= tr ((V'fi*V5)(V-iG v V-5)) = tr (^(V^r2*V3)AQA T 

A/ 

- tr(A T (V^*V5)AQ) =5^Q ili a?(V'n*V')a < . 

i=l 

Since Q»,i > and V^fTVs" g §+, we can deduce 

a^V^'V^a, = if Q M > 0, V* = 1, . . ., M. 

Let I + = {i : Qj ; > 0}. Because V^iTV'J is positive semidcfinitc, for all zq € 1+ we also have 
V5f2*V^a; = 0. Furthermore, since 

(I - V^GyV^)- 1 = Adiag 



1-Qi,i' l-Q 



M.M 



multiplying both sides of (fTo) by a io leads to 



1 Qio,io 



= V5 (Ssam- A'l) V^, 



which shows aj is an eigenvector of V2 (Ssam — A'l) V*. Recall that V2 (S SAM — 

A'I)V5 = UDU T . 

Without loss of generality, we can take &i — U{ for all i £ I+. Indeed, since 1+ — {i : Qi.i > 0}, we have 

M 

AQA T = Qt.t^aJ = Y Qt ^ a 7 + ' a j a J = Y ( i " ,I U -' + ■ U J U J 

i=l iGl+ iGl+ j$X+ 

which implies we can further take A = U. This gives 

(I — V^'GyV^^ 1 = ULU T , (17) 
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where L is a diagonal matrix with entries = 1 _q , for i = 1, . . . , M. Plugging this into (|16[) we have 

ULU T = UDU T + V2fi*vi (18) 

For any io £ 1+ , multiplying the both sides of the above equation by Ui results in 

Li .i Ui = Di 0! i Ui + 

which implies Li .i = D^ 0i i , or more generally 

D M if Q M > 

1 4.1, • ) i=±, ■■■,M. 
1 otherwise 

Since Lj^ = 1 _q. > 1, to see Lj ^ = maxjD^ 1}, it remains to show Lj^ > D^j for all i. This follows by 
rearranging (|18| 

ulu t - udu t = v^rrvs ^ o 

L - D h Tj iti > D M , Vi = 1, . . . ,M. 
Finally, plugging (| 1 T[) into (fT5|) completes the proof. ■ 



Theorem 1 Ssam and S{jtm share the same trace and eigenvectors, and letting the eigenvalues of the two 
matrices, sorted in decreasing order, be denoted by si, . . . , sm and hi, . . . , Km, respectively, we have 

h m = max |s m - i j , /or m = 1, . . . , M. 

Proof Let Udiag(si, . . . , sm)U t be an eigendecomposition of Ssam such that U is orthonormal. Define 
V = v\, and note that an eigendecomposition of matrix 

_ 2A T \ _.i 2 A 

can be written as UDU T , where Dj j = u(sj — 2X/N), i = 1, . . , , M . By Lemma[T]we have 



■^UTM 



where L £ H)^ , and L^j = max{Dj i i,l} = max{-0(si — 2A/7V),1} = -Omaxjsi — 2X/N,l/v} = vhi,Vi = 
l,...,M. Therefore, 

E£ TM = 4uLU T = UHU T . 

where H = diag(/ii, . . . , Hm), as desired. 

Furthermore, recall that we impose no constraint on v when solving UTM, and as a result the partial 
derivative of the objective function with respect to v should vanish at v. That is, 



d_ 

dv 



(logp(*|E) - Atr(G)) | 6 _ = ~ (tr(SsAM) - tr ((«I - G)- 1 )) = 



^utm) 



which implies tr(S SA M) = tr - G)" 1 ) = tr (£{ 

Theorem 2 For all K, scalars l\ > £2 > • • • > > 0-2 > 0, p e (0,1), sequences N(m) suc -h that 
\M/N[M\ — p\ = o(l/ y/N^j^), covariance matrices si M ^ € with eigenvalues ,£k,& 2 , ■ ■ ■ ,& 2 , and 

i such that £i > (1 + \fp)o~ 2 , there exists ej € (0, 2a 2 /(£ t — er 2 )) swc/i i/iai ft* — > Sj — (2 + e l )pa 2 . 
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Proof Let ALA T be an eigendecomposition of £*, where A = [ai 
diag(^i, . . . , Ik , <j 2 ,. . . , a 2 ). Recall that 



ajf] is orthonormal and L = 



(K M 
3=1 j=K+l 
K M 

= ^(bj^f + a* £ ( b ^i) 2 - 

i=i j=K+i 



Using Theorem 4 in iPaull ()2007() , we have 



T \2 as ; 



(b/a 



1 - 



\ £i-a 2 



by Theorem 5 in Paul ( 2007 ) we have — a^ . This implies if 1 < j < K , j ^ i 



and for K < j < M, 



E ( b ^) 2 = 1 - 

j=K+l 



3=1 



Plugging (pQ]> . ([21) . and {22} into {HJ we arrive at 



(A 



1 + p(J 2/(|. _ a 2) 

(l-p)a 2 



4-1 



pa 



By Theorem 1.1 in lBaik and Silversteinl (|2006l ) we have 

n o ^2 



4 + 



1 



4 -a 2 



pa 



Finally, combining and JjH]) yields 



/l* )• Si - (2 + £i)/C)(T 2 



where = + j^zji- It is easy to see < e» < + ^z^r, 



as desired. 



(19) 



(20) 



Furthermore, decomposing into + , where b^ e span(ai, . . . ,slk) and b^ e span(aA'+i, ■ • ■ , ajt), 

a^. This implies if 1 < 

bjaj = bjaj ||biK% = (21) 

^ 1 - (b^) 2 . 



(22) 



^)/(^)^H'-^)/0 + i£0) 



(23) 



(24) 



Proposition 1 For any fixed N, K , scalars £i > £2 > ■ ■ ■ > £k > a 2 > 0, and any sequence of covariance 
matrices sl M ' G with eigenvalues £\, ... ,tjc, a 2 , . . . , a 2 , we have 

trSsAM a.s. 



tr£* 



1 



Pr 



trS 



SAM 



>e < 



1 

Ne 2 \M 



O 



\M 2 
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Proof Let ALA T be an eigendecomposition of X*, where A = [ai ... &m} is orthonormal and L 
diag(^i, ...,£ K ,a 2 ,.. . ,cr 2 ). Thus, tr£* = trL = £f =1 4 + (M - K)a 2 . Note that 

M M , N 



trSsAM = tr(AA T S SA M) = tr(A T £ S AMA) = a^SsAM^ = ^ (a?x (n 



2—1 n— 1 



Since X(„) <~ Af(0, £*), we can think of each X(„) as generated by X( n ) = Az( n ), where Z( n ) is sampled iid 
from W(0, L). This leads to 

M 1 JV M 



trS S AM = f;^X; 2i (n),i = E 



where u> 2 's are iid samples from x%- Therefore, 



,. trSsAM ,. 

km — — = hm 

m-j-oo trXL 



2 



i=l 



lim 



TV 



M 



M ^°° Ef=l 4 + (M - if )<7 2 M-voo 4 + ( M _ ^) CT 2 



Since the first terms in the denominator and the numerator are bounded and do not scale with K, we can 
drop them in the limit and rewrite 



Y^M 2 w i 

lim — — = hm 



~-K+\ u N 



lim 



1 



M 9 



M->cx> tr£* M-s-oo (M - if )<7 2 M->oo M - K N 

due to the strong law of large numbers and the fact that E[w 2 ] = N. 

To prove the second part of the proposition, note that E[tr£sAM] = trS* and 



£ i = i(w. P . i) 



Var[tr£ SA M] = Var 
since Var[u> 2 ] = 2N. By Chebyshev's inequality we have 



K 



M 



2.4^+ ^ a at 



TV ^— ' TV 

fe=l i=K+l 



K 



vfe=i 



Pr 



trS SA 



M 



trS* 



- 1 



Var 



>e < 



< 



trSs 



trS» 



Var[trS SA M] = 2 ELi *fc + ( M ~ 
e 2 (trS*) 2 We 2 """" 



(Ef=i4 + (M-tf)<r 



2 Ef=i(^-^) + M<7 4 



TVe 2 



(Ma 2 ) 5 



2 / 1 Etijin^) ' 

Ne 2 \ M M 2 a 4 



O 



Ne 2 \M \M 2 



Proposition 2 Suppose R* = diag(r, 1, 1, . . . , 1) and = [1 1 ... 1] T [1 1 ••• 1], r > 1. Lei 

f = [/i ... /m] T fre fie top eigenvector o/S*. TTien we Ziaue = f2(r),Vi > 1. 

Proof Note that / is the solution to the following optimization problem 

max u T (R* + F*)u 

u£l« 

s.t. ||u|| 2 = 1 

and the objective function can written be as rf 2 + J2fL 2 ff + ^Ef=i fij ■ By symmetry, we have f 2 = fz = 

■ ■ ■ = Jm- To simplify notation, let us represent f as [x y y ... y] T . Suppose the largest eigenvalue is 
q. By definition we have (R* + F*)f = qf, or equivalently 

(r + l)x + (M — l)y = qx 
x + My = qy. 



17 



Solving the above equations leads to 

q = - (M + r + 1 + v/(M + r + l) 2 -4(Afr+l)j = Q(r), 

and plugging this back to the above equations yields 

x/y = q - M = Q(r). 



Proposition 3 Suppose R* and F* are given as in Proposition^ and let the estimate resulting from i fTU)) 
be S = R* + F. Then for all X > we have 

1. rank(F) = 1 if and only if A < MiV/2. 

D. In £/ia£ case, if we rewrite F as ff T , where f = [/i ... /m] T , i/ien Vi > 1, fijfi is greater than 1 
and monotonically increasing with r. Furthermore, if A > (M — l)N/2, then f\jf% = Q(r). 

Proof Let A' = 2X/N, C = R~^ (£» - A'l) R7^, UDU T be an eigendecomposition of C, and L be the 
diagonal matrix with entries Lj ; j = maxjD^i, 1} ,Vi = 1, . . ., M. Applying Lemma [T] with Ssam = £* and 
V = Rj 1 we have 

t = R|ULU T R| = R?UIU T RJ + R?U(L - I)U T RJ = R* + R?U(L - I)U T RJ 

and therefore 

F = R?U(L-I)U T RJ. 

Since L< < = max{D^i, 1}, we further have rank(F) = rank(L — I) = | {i : D^j > 1} |. Recall that D denotes 
the eigenvalues of matrix C, which can be written as 

R^(S,-A'I)R^ = R~'£*R^ - A'R^ 1 

= R^ (R* + F»)R^ -A'll-fl- M e x ej 

= I + aa T - A'I + A' ^1 - -^j e ie J 

= (1 - A')I + aa T + A' (l - e ie J 



r i T 

where a = -±= 1 ... 1 and ei = [1 ... 0] T . Let A = aa T + A' (l - ±) e^. Since C = 

(1 — A') I + A, we know C and A share the same eigenvectors, and the corresponding eigenvalues differ 
by (1 — A'). Thus, the number of the eigenvalues of C that are greater than 1 is equal to the number 
of the eigenvalues of A that are greater than A'. However, rank(A) — rank(aa T + A' (l — eieJ) = 2, 
which implies A has only 2 non-zero eigenvalues. Let q be one of them. By symmetry, we can denote the 
corresponding eigenvector by u — [x y ... y] T . Then we have Au = qu, which leads to 

x (M-l)y A. A'\ 
- + 7 = J ^ L + I A )x = qx 



r wr \ r 



^- + {M-l)y = qy. (25) 
\ r 



After eliminating x and y we arrive at the following equation 

rq 2 - ((M - l)r + (r - 1)A' + l)q + X'(M - l)(r - 1) = 0. 
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Let s(q) denote the left-hand-side of the above equation. It is easy to see that its discriminant A > 0, 
and therefore the equation s(q) — has two distinct real roots, each corresponding to one of the non-zero 
eigenvalues of A. Recall that rank(F) equals the number of the eigenvalues of A that are greater than A', 
which is equal to the number of the roots of s(q) = that are greater than A'. Thus, rank(F) = 1 if and 
only if one root of s(q) = is greater than A' and the other is less than A'. This is equivalent to 

s(A') < ^ A' (A' — M) < <^> A' < M <^> A < 

To prove the second part of this proposition, let g+ be the greatest eigenvalue of A, or equivalently the 
greater root of s{q) = 0, and u + be the corresponding eigenvector. That is, 



q+ 2r 



i ((M - l)r + (r - 1)A' + 1 + Va) 
and u+ = [x+ y+ ... y+] T , where (x+,y+) is a solution of (x, y) in (|25|) given q = q+. Recall that 

F = RjU(L - I)U T R? = R| ((q + - A')u + u^) Rf , 

which leads to f = (q+ — A')2R* u+ = {q+ — A') 2 [y / rx+ y+ ... y+], and 

fx = Vrx ±=ri l _ M ^ Vf>L 

fi y+ 

It is easy to show lim A ,^ + q + = i + M — 1 and as a result liniv_ > o+ ^ = 1- Therefore, to show j- is greater 
than 1 and monotonically increasing with r, it is sufficient to show that 

dfy 



-#>0, Vr > 1,A' e (0,M). 



By straight forward algebra we have 



-A = (A' - M + 1) + -= ((A'r - A' - Mr + r)(A' - M + 1) + (M - 1 + A')) . 
dr VA 



Now consider three cases: 

1. < A < (M ~ 1)Af : In this case < A' < M - 1 and 

d&- 

-p->0 <s=> ((A'r — A' — Mr + r)(A' — M + 1) + (M — 1 + A')) > (M — 1 — X')VA 
dr 

-A'r + X' + Mr-r+ ^Lzl±^ ] ' > A. 

Expanding the both sides of the last inequality yields the desired result. 

2. A = ( M ~ 1 ) iV : j n t n i s case X = M — 1 and -Jf- = 2 ^y^ 1 ' > > 0, as desired. 

3. (M ~ 1)jv < A < ^ : In this case A'-M + le (0, 1), and we have 

(A'r - A' - Mr + r)(A' - M + 1) + (M - 1 + A') 
= r(A' - M + if - A'(A' - M + 1) + (A' + M - 1) 
> -A' + (A' + M-1) >0 

which implies > (A' — M + 1). Since the derivative is bounded below by a positive constant, we 
conclude j- = Q(r). 
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B Experiment Details 

B.l Coordinate Ascent Algorithm for STM 

Algorithm [3] describes our coordinate ascent method for solving STM. 

Algorithm 3 Procedure for solving STM 
Input: X, A 
Output: Sg TM 

T <-I. 
repeat 

S <- UTM(TA\A)_ 

T <- argmaxlogp(TA'|£), s.t. logdetT>0 

TeD M 

until converge 



B.2 Cross Validation for Synthetic Data Experiment 

For the synthetic data experiment, we select the regularization parameter via the following cross-validation 
procedure. Let 9 be the regularization parameter to be determined and U(X,9) be the learning algorithm 
that takes as input (X ', 9) and returns a covariance matrix estimate. We randomly split X into a partial 
training set X^ and a validation set Xy, whose sizes are roughly 70% and 30% of X, respectively. For each 
candidate value of 9, = U(X^,9) is computed and the likelihood p{Xy\Sl? r ) of the validation set Xy 
conditioned on the solution is evaluated. The value of 9 that maximizes this likelihood is then selected 
and fed into U(X,9) along with the full training set X, resulting in our estimate X 61 . 

In our synthetic data experiment, the K for URM/EM/MRH are selected from {0, 1, . . . , 15}, and the 
A for UTM/TM/STM are selected from {100, 120, . . . , 400}. These ranges are chosen so that the selected 
values rarely fall on the extremes. 



B.3 Termination Criteria for Iterative Algorithms 



In our implementation, the EM algorithm terminates when max 

rpNcw_rj,Cur 



Rv 



< 0.001. Similarly, the STM 



algorithm terminates when maxj 



nCurrcnt 



< 0.001. 



B.4 Equivalent Data Requirement 

Consider two learning algorithms U\ and U2, and N data samples X — {x^), . . . , X(jv)}- Denote the out-of- 
sample log-likelihood delivered by U with X by L(U,X). Suppose U2 generally has better performance than 
U\. Algorithm [4] evaluates the equivalent data requirement of U2 with respect to U\. In our implementation, 
we set step size a = 2% for uniform-residual experiment and a = 10% for nonuniform-residual ones. 



B.5 S&P500 Data Preprocessing 

Dehne November 2, 2001 as trading day 1 and August 9, 2007 as trading day 1451. After deleting 47 
constituent stocks that are not fully defined over this period, we compute for each stock the normalized log 
daily returns as follows: 

1. Let y[ a be the adjusted close price of stock i on day j, i = 1, . . . , 453 and j = 1, . . . , 1451 . 

2. Compute the raw log-daily-return of stock i on day j by 

^.=log^f±i, i = l,...,453, j = l,...,1450. 
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Algorithm 4 Procedure for evaluating equivalent data requirement 
Input: X,U\,U2 
Output: 7 

i <- 

while 1 do 

7 <— 1 — ia 

Xi <- {x(i), . . . , X( 7A r)} 
if L{U 2 ,Xi) < L(Ui,X) then 
if i > then 

7^7 + T^t&^&cT) a (interpolation) 
end if 
return 7 
end if 
i 4- i + 1 
end while 



3. Let y be the smallest number such that at least 99.5% of all y" ■ are less than or equal to y. Let y be 
the largest number such that at least 99.5% of all y'l/s are greater than or equal to y. Clip all y'/j by 
the interval [y, y]. 

j 50 

4. Let the volatility of stock i on day j > 50 be the 10- week rms <7jj — \ ^ v" 2 



i,j — f 

t=l 



5. Set y (n) 



T 

Vl,n + 50 ^453,n + 50 



"■l,n + 50 ti + 50 



for n = 1, 1400. 



B.6 Candidates for Regularization Parameter in Real Data Experiment 

In our real data experiment, the K for EM/MRH are selected from {0, 1, . . .,40}, and the A for TM/STM 
are selected from{200, 210, . . . , 600}. These ranges are chosen so that the selected values never fall on the 
extremes. 
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